Tritonプログラミング入門：性能の逆説：正しいコードがなぜ遅いのか

この性能の逆説数学的に完璧なカーネル（例：$out = x + y$）であっても、GPUハードウェアの固定コストを十分に均等化できなければ、CPUループよりも実行速度が悪くなると述べています。これはしばしば 起動課税として現れます。

1. 「正しさ」の誤謬

機能的な正しさは効率性の指標とはなりません。Tritonコードが数千スレッドに仕事を正確に分散しているとしても、作業量全体（N）が小さい場合、GPUは未利用状態のままになります。ハードウェアは実際の演算より、状態遷移に多くの時間を費やします。

2. Pythonによる測定の落とし穴

Pythonから time.time() を使用してGPUコードのベンチマークを取ることは危険です。GPU呼び出しは 非同期であり、Pythonはただ キューに登録 コマンドを登録して次に進みます。 torch.cuda.synchronize()を実行しないと、キューイング時間しか測定できません。同期処理を行うと、 ホストからデバイスへのレイテンシを測定できます。これは、カーネル実行時間よりも10倍長いことがよくあります。

3. レイテンシとスループットの違い

この逆説を克服するには、起動レイテンシを「隠す」だけの作業量を提供しなければなりません。これは、 レイテンシ制限 モード（CPU-GPUバスによって制限される）から スループット制限 モード（GPUメモリまたは計算能力によって制限される）への移行です。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.